5420 Anomaly Detection | Assignment 5 - Joyce Ng (jn2901)¶
Credit Card Fraud Detection Modeling¶
The Office of Management and Enterprise Services in the State of Oklahoma has made its purchase credit card transactions available. This dataset contains information on purchases made through the purchase card programs administered by the state and higher education institutions. This analysis will utilized Histogram Based Outlier Score and Empirical Cumulative Distribution-based Outlier Detection to identify anomalies.
Table of Contents¶
- Section 1: Data Preparation
- Section 2: EDA
- Section 3: Feature Engineering
- Section 4: Histogram-Based Outlier Score (HBOS)
- Section 5: Empirical Cumulative Distribution-based Outlier Detection (ECOD)
- Section 6: Models Predictions Comparison
- Section 7: Conclusion
Section 1: Data Preparation ¶
1.1 Load Libraries and Dataset ¶
# Load libraries
import pandas as pd
import numpy as np
from functools import reduce
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve,roc_auc_score, confusion_matrix, f1_score, accuracy_score, make_scorer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from pyod.utils.data import generate_data
from pyod.models.hbos import HBOS
from pyod.models.combination import aom, moa, average, maximization
from pyod.utils.utility import standardizer
from pyod.models.ecod import ECOD
# Visiualization
import plotly.express as px
import plotly.graph_objs as go
import plotly.subplots as sp
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import plotly.io as pio
from IPython.display import display
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
df = pd.read_csv('/Users/Joyce630/Desktop/Columbia/5420 Anomaly Detection/Assignments/2 - Credit Card/purchase_credit_card.csv')
df.head()
| Year-Month | Agency Number | Agency Name | Cardholder Last Name | Cardholder First Initial | Description | Amount | Vendor | Transaction Date | Posted Date | Merchant Category Code (MCC) | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 201307 | 1000 | OKLAHOMA STATE UNIVERSITY | Mason | C | GENERAL PURCHASE | 890.00 | NACAS | 07/30/2013 12:00:00 AM | 07/31/2013 12:00:00 AM | CHARITABLE AND SOCIAL SERVICE ORGANIZATIONS |
| 1 | 201307 | 1000 | OKLAHOMA STATE UNIVERSITY | Mason | C | ROOM CHARGES | 368.96 | SHERATON HOTEL | 07/30/2013 12:00:00 AM | 07/31/2013 12:00:00 AM | SHERATON |
| 2 | 201307 | 1000 | OKLAHOMA STATE UNIVERSITY | Massey | J | GENERAL PURCHASE | 165.82 | SEARS.COM 9300 | 07/29/2013 12:00:00 AM | 07/31/2013 12:00:00 AM | DIRCT MARKETING/DIRCT MARKETERS--NOT ELSEWHERE... |
| 3 | 201307 | 1000 | OKLAHOMA STATE UNIVERSITY | Massey | T | GENERAL PURCHASE | 96.39 | WAL-MART #0137 | 07/30/2013 12:00:00 AM | 07/31/2013 12:00:00 AM | GROCERY STORES,AND SUPERMARKETS |
| 4 | 201307 | 1000 | OKLAHOMA STATE UNIVERSITY | Mauro-Herrera | M | HAMMERMILL COPY PLUS COPY EA | 125.96 | STAPLES DIRECT | 07/30/2013 12:00:00 AM | 07/31/2013 12:00:00 AM | STATIONERY, OFFICE SUPPLIES, PRINTING AND WRIT... |
# Check dimensions of the dataset, we found it has 442,458 rows and 11 columns
print(df.shape, "\n") # Check the shape of df
print(df.columns, "\n") # Check column names
print(df.info(), "\n") # Check info of the df
df.describe() # Get the Simple Summary Statistics
(442458, 11)
Index(['Year-Month', 'Agency Number', 'Agency Name', 'Cardholder Last Name',
'Cardholder First Initial', 'Description', 'Amount', 'Vendor',
'Transaction Date', 'Posted Date', 'Merchant Category Code (MCC)'],
dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442458 entries, 0 to 442457
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Year-Month 442458 non-null int64
1 Agency Number 442458 non-null int64
2 Agency Name 442458 non-null object
3 Cardholder Last Name 442458 non-null object
4 Cardholder First Initial 442458 non-null object
5 Description 442458 non-null object
6 Amount 442458 non-null float64
7 Vendor 442458 non-null object
8 Transaction Date 442458 non-null object
9 Posted Date 442458 non-null object
10 Merchant Category Code (MCC) 442458 non-null object
dtypes: float64(1), int64(2), object(8)
memory usage: 37.1+ MB
None
| Year-Month | Agency Number | Amount | |
|---|---|---|---|
| count | 442458.000000 | 442458.000000 | 4.424580e+05 |
| mean | 201357.284375 | 42785.860353 | 4.249912e+02 |
| std | 47.107417 | 33378.461293 | 5.266509e+03 |
| min | 201307.000000 | 1000.000000 | -4.286304e+04 |
| 25% | 201309.000000 | 1000.000000 | 3.091000e+01 |
| 50% | 201401.000000 | 47700.000000 | 1.048900e+02 |
| 75% | 201404.000000 | 76000.000000 | 3.450000e+02 |
| max | 201406.000000 | 98000.000000 | 1.903858e+06 |
1.2: Check Missing Values & Change Columns Names and Data Type ¶
# Check for missing values
missing_values = df.isnull().sum()
missing_values
Year-Month 0 Agency Number 0 Agency Name 0 Cardholder Last Name 0 Cardholder First Initial 0 Description 0 Amount 0 Vendor 0 Transaction Date 0 Posted Date 0 Merchant Category Code (MCC) 0 dtype: int64
1.3: Create Data Fields ¶
# Change column names
df.columns = ['Year_Month', 'Agency_Number', 'Agency_Name', 'Cardholder_Last_Name',
'Cardholder_First_Initial', 'Description', 'Amount', 'Vendor', 'Transaction_Date',
'Posted_Date', 'Merchant_Category']
# Creating separate Year and Month columns for future feature engineering
# Converting Transaction_Date and Posted_date from Python Object data type to datetime
df['Transaction_Date']=pd.to_datetime(df['Transaction_Date'])
df['Posted_Date']=pd.to_datetime(df['Posted_Date'])
df['Year_Month'].dtype # Check data type for information purpose
df['Year_Month']=pd.to_datetime(df['Year_Month'], format='%Y%m')
df['Year'] = df['Year_Month'].dt.year
df['Month'] = df['Year_Month'].dt.month
#Creating 'Week_Number' from 'Transaction_Date'
df['Week_Number'] = df['Transaction_Date'].dt.isocalendar().week
#Creating 'Day_of_Week' from 'Transaction_Date'
dayOfWeek={0:'Monday', 1:'Tuesday', 2:'Wednesday', 3:'Thursday', 4:'Friday', 5:'Saturday', 6:'Sunday'}
df['Day_of_Week'] = df['Transaction_Date'].dt.dayofweek.map(dayOfWeek)
df['Day_of_Week'].value_counts()
Day_of_Week Wednesday 89644 Thursday 87537 Tuesday 86919 Friday 79917 Monday 61809 Saturday 26307 Sunday 10325 Name: count, dtype: int64
df.head()
| Year_Month | Agency_Number | Agency_Name | Cardholder_Last_Name | Cardholder_First_Initial | Description | Amount | Vendor | Transaction_Date | Posted_Date | Merchant_Category | Year | Month | Week_Number | Day_of_Week | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013-07-01 | 1000 | OKLAHOMA STATE UNIVERSITY | Mason | C | GENERAL PURCHASE | 890.00 | NACAS | 2013-07-30 | 2013-07-31 | CHARITABLE AND SOCIAL SERVICE ORGANIZATIONS | 2013 | 7 | 31 | Tuesday |
| 1 | 2013-07-01 | 1000 | OKLAHOMA STATE UNIVERSITY | Mason | C | ROOM CHARGES | 368.96 | SHERATON HOTEL | 2013-07-30 | 2013-07-31 | SHERATON | 2013 | 7 | 31 | Tuesday |
| 2 | 2013-07-01 | 1000 | OKLAHOMA STATE UNIVERSITY | Massey | J | GENERAL PURCHASE | 165.82 | SEARS.COM 9300 | 2013-07-29 | 2013-07-31 | DIRCT MARKETING/DIRCT MARKETERS--NOT ELSEWHERE... | 2013 | 7 | 31 | Monday |
| 3 | 2013-07-01 | 1000 | OKLAHOMA STATE UNIVERSITY | Massey | T | GENERAL PURCHASE | 96.39 | WAL-MART #0137 | 2013-07-30 | 2013-07-31 | GROCERY STORES,AND SUPERMARKETS | 2013 | 7 | 31 | Tuesday |
| 4 | 2013-07-01 | 1000 | OKLAHOMA STATE UNIVERSITY | Mauro-Herrera | M | HAMMERMILL COPY PLUS COPY EA | 125.96 | STAPLES DIRECT | 2013-07-30 | 2013-07-31 | STATIONERY, OFFICE SUPPLIES, PRINTING AND WRIT... | 2013 | 7 | 31 | Tuesday |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 442458 entries, 0 to 442457 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year_Month 442458 non-null datetime64[ns] 1 Agency_Number 442458 non-null int64 2 Agency_Name 442458 non-null object 3 Cardholder_Last_Name 442458 non-null object 4 Cardholder_First_Initial 442458 non-null object 5 Description 442458 non-null object 6 Amount 442458 non-null float64 7 Vendor 442458 non-null object 8 Transaction_Date 442458 non-null datetime64[ns] 9 Posted_Date 442458 non-null datetime64[ns] 10 Merchant_Category 442458 non-null object 11 Year 442458 non-null int32 12 Month 442458 non-null int32 13 Week_Number 442458 non-null UInt32 14 Day_of_Week 442458 non-null object dtypes: UInt32(1), datetime64[ns](3), float64(1), int32(2), int64(1), object(7) memory usage: 46.0+ MB
# Summarize the count statistics by agency_name in df_count
df_count = df['Agency_Name'].value_counts()
df_count = pd.DataFrame(df_count).reset_index()
df_count.columns = ['Agency_Name','Count']
fig = px.bar(df_count, x='Agency_Name', y='Count', color = 'Agency_Name', width=1000, height=400)
sns.set_theme(style="whitegrid")
sns.barplot(y="Agency_Name", x="Count", data=df_count[1:20])
plt.title('Distribution of Agency', fontweight='bold')
plt.xlabel('Count')
plt.ylabel('Agency Name')
plt.show()
Distribution of Transaction Amount¶
# Set the style of the seaborn plot
sns.set_style("whitegrid")
# Create a histogram for transaction amounts with a log scale
plt.figure(figsize=(10, 6))
ax = sns.histplot(data=df['Amount'], bins=100, kde=True, log_scale=True)
# Format x-axis tick labels to add comma formatting
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: '${:,.0f}'.format(x) if x >= 1 else '${:.2f}'.format(x)))
plt.title('Distribution of Transaction Amounts', fontweight='bold')
plt.xlabel('Transaction Amount')
plt.ylabel('Frequency')
plt.savefig('Distribution of Transaction Amount.png', transparent=True)
# Display the plot
plt.show()
Interaction of Transaction Amounts by Agency and Merchant Category¶
# Calculate the total transaction amount for each Agency and Merchant Category
df_grouped = df.groupby(['Agency_Name', 'Merchant_Category'])['Amount'].sum().reset_index()
# Take top 40 agencies for a clearer visualization
df_grouped = df_grouped.sort_values(by='Amount', ascending=False)
df_grouped = df_grouped.head(40)
# Sort DataFrame by Amount
df_grouped = df_grouped.sort_values(by='Amount', ascending=True)
# Create the scatter plot
fig = px.scatter(df_grouped,
x="Agency_Name",
y="Merchant_Category",
color="Amount",
size="Amount",
title='Interaction of Transaction Amounts by Agency and Merchant Category',
width=1000, height=600,
color_continuous_scale='Viridis_r'
)
# Customize layout to edit legend title and set background to white
fig.update_layout(
coloraxis_colorbar=dict(
title="Amount ($)" # Legend title
),
title={
'text': 'Interaction of Transaction Amounts by Agency and Merchant Category',
'y':0.9,
'x':0.5,
'xanchor': 'center',
'yanchor': 'top'
},
paper_bgcolor="rgba(0,0,0,0)", # Set the background color of the paper (outer) to transparent
plot_bgcolor='white' # Set the background color of the plotting area (inner) to white
)
# Customize x-axis
fig.update_xaxes(
title = 'Agency Name',
mirror=True,
ticks='outside',
showline=True,
linecolor='lightgrey',
gridcolor='lightgrey',
showgrid=False,
tickangle=-30
)
# Customize y-axis
fig.update_yaxes(
title = 'Merchant Category',
mirror=True,
ticks='outside',
showline=True,
linecolor='lightgrey',
gridcolor='lightgrey',
showgrid=False
)
fig.write_image("Interactions of two or three variables.png")
# Show the plot
fig.show()
Section 3: Feature Engineering ¶
- Avg amount by agency, merchant
- Avg amount by agency, merchant in last 30 days
- Avg amount by agency, merchant in last 60 days
- Avg amount by agency, merchant in last 90 days
- Max amount by agency, merchant in last 30 days
- Max amount by agency, merchant in last 60 days
- Max amount by agency, merchant in last 90 days
- Avg transaction by agency, merchant
- Avg transaction by agency, merchant in last 30 days
- Avg transaction by agency, merchant in last 60 days
- Avg transaction by agency, merchant in last 90 days
- Avg among by agency, vendor
- Avg amount by agency, vendor in last 30 days
- Avg amount by agency, vendor in last 60 days
- Avg amount by agency, vendor in last 90 days
Features mainly focused on 4 main categories: Average Amount by Agency and Merchant, Maximum Amount by Agency and Merchant, Average Transaction Count by Agency and Merchant, Average Amount by Agency and Vendor. Within each category, we will look at the overall transaction data and across different time period i.e. in last 30, 60, 90 days to identify spending patterns and potential anomalies.
3.1.1 Avg Amount by Agency, Merchant Ratio¶
# Define functions and variables for benchmark calculations
def calculate_avg_amount(data, group_by_cols):
return data.groupby(group_by_cols)['Amount'].mean().reset_index(name='Avg_Amount')
def calculate_max_amount(data, group_by_cols):
return data.groupby(group_by_cols)['Amount'].max().reset_index(name='Max_Amount')
def calculate_avg_transaction(data, group_by_cols):
return data.groupby(group_by_cols).size().reset_index(name='Avg_Transaction_Count')
def calculate_last_n_days(data, n):
cutoff_date = data['Transaction_Date'].max() - pd.Timedelta(days=n)
return data[data['Transaction_Date'] > cutoff_date]
# Define the group by columns
agency_merchant_cols = ['Agency_Name', 'Merchant_Category']
agency_vendor_cols = ['Agency_Name', 'Vendor']
table_1 = df.groupby(['Agency_Name', 'Merchant_Category'])['Amount'].mean().reset_index()
table_1 = table_1.rename(columns={'Amount': 'Average_Merchant_amount'})
table_1.head()
| Agency_Name | Merchant_Category | Average_Merchant_amount | |
|---|---|---|---|
| 0 | ARDMORE HIGHER EDUCATION CENTER | BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED | 115.120000 |
| 1 | ARDMORE HIGHER EDUCATION CENTER | CATALOG MERCHANTS | -53.344286 |
| 2 | ARDMORE HIGHER EDUCATION CENTER | COMPUTER NETWORK/INFORMATION SERVICES | 18.340000 |
| 3 | ARDMORE HIGHER EDUCATION CENTER | DRUG STORES AND PHARMACIES | 153.450000 |
| 4 | ARDMORE HIGHER EDUCATION CENTER | GROCERY STORES,AND SUPERMARKETS | 74.525455 |
def calculate_descriptive_stats(data, column):
'''
This function calculate descriptive statistics with additional quantile.
'''
# Calculate basic descriptive statistics
desc_stats = data[column].describe()
# Calculate the 90th and 95th percentiles
perc_90 = data[column].quantile(0.90)
perc_95 = data[column].quantile(0.95)
# Convert the descriptive statistics to a DataFrame
desc_stats = desc_stats.to_frame().transpose()
# Add the 90th and 95th percentiles to the DataFrame
desc_stats['90%'] = perc_90
desc_stats['95%'] = perc_95
# Reorder the columns to match the desired format
desc_stats = desc_stats[['mean', 'min', '25%', '50%', '75%', '90%', '95%', 'max']]
# Transpose the DataFrame
desc_stats = desc_stats.T
return desc_stats
table_1_stats = calculate_descriptive_stats(table_1, 'Average_Merchant_amount')
table_1_stats
| Average_Merchant_amount | |
|---|---|
| mean | 526.650020 |
| min | -3047.263333 |
| 25% | 111.995000 |
| 50% | 269.450000 |
| 75% | 527.840500 |
| 90% | 1007.506000 |
| 95% | 1574.364300 |
| max | 171619.610000 |
table_1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8427 entries, 0 to 8426 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Agency_Name 8427 non-null object 1 Merchant_Category 8427 non-null object 2 Average_Merchant_amount 8427 non-null float64 dtypes: float64(1), object(2) memory usage: 197.6+ KB
# Merge data back to the dataset
df1 = pd.merge(df, table_1, how='left', on=['Agency_Name', 'Merchant_Category'])
# Calculate ratio
df1['Average_Merchant_amount_Ratio'] = np.where(df1['Average_Merchant_amount']==0,0, df1['Amount']/df1['Average_Merchant_amount'])
df1['Average_Merchant_amount_Ratio'].describe()
count 442458.000000 mean 0.999724 std 7.225363 min -2735.680000 25% 0.177145 50% 0.515301 75% 1.102916 max 2735.680000 Name: Average_Merchant_amount_Ratio, dtype: float64
# Define variables and quantile range
var = 'Average_Merchant_amount_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]
# Create the binned variable
df1[binned_var] = pd.qcut(df1[var], percentile)
# Create the bin labels
df1['bin_label'] = df1[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df1['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)
# Update the layout
fig.update_layout(
title='Distribution of Average Merchant Amount Ratio',
xaxis_title='Quantile Range',
yaxis_title='Count',
template = 'simple_white'
)
# Show the plot
fig.show()
3.1.2 Avg Amount by Agency, Merchant Ratio in last 30 days ¶
# Feature 2: Average Amount by Agency, Merchant in last 30 days
table_2 = calculate_avg_amount(calculate_last_n_days(df, 30), agency_merchant_cols)
table_2 = table_2.rename(columns={'Avg_Amount': 'Average_Merchant_amount_30days'})
table_2.head()
| Agency_Name | Merchant_Category | Average_Merchant_amount_30days | |
|---|---|---|---|
| 0 | ATTORNEY GENERAL | ADVERTISING SERVICES | 1040.000000 |
| 1 | ATTORNEY GENERAL | AMERICAN AIRLINES | 462.781429 |
| 2 | ATTORNEY GENERAL | BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED | 436.428571 |
| 3 | ATTORNEY GENERAL | CABLE, SATELLITE, AND OTHER PAY TELEVISION AND... | 24.590000 |
| 4 | ATTORNEY GENERAL | CHARITABLE AND SOCIAL SERVICE ORGANIZATIONS | 875.000000 |
table_2.isnull().sum()
Agency_Name 0 Merchant_Category 0 Average_Merchant_amount_30days 0 dtype: int64
table_2.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3372 entries, 0 to 3371 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Agency_Name 3372 non-null object 1 Merchant_Category 3372 non-null object 2 Average_Merchant_amount_30days 3372 non-null float64 dtypes: float64(1), object(2) memory usage: 79.2+ KB
table_2_stats = calculate_descriptive_stats(table_2, 'Average_Merchant_amount_30days')
table_2_stats
| Average_Merchant_amount_30days | |
|---|---|
| mean | 529.500247 |
| min | -2393.700000 |
| 25% | 86.884113 |
| 50% | 240.048810 |
| 75% | 556.442353 |
| 90% | 1089.924333 |
| 95% | 1886.608333 |
| max | 45784.640000 |
# Add benchmarks back to original df
df1 = pd.merge(df1, table_2, how='left', on=['Agency_Name', 'Merchant_Category'])
df1['Average_Merchant_amount_30days'].fillna(0, inplace=True)
# Calculate the ratio
df1['Average_Merchant_amount_30days_Ratio'] = np.where(df1['Average_Merchant_amount_30days']==0,0, df1['Amount']/df1['Average_Merchant_amount_30days'])
df1['Average_Merchant_amount_30days_Ratio'].describe()
count 442458.000000 mean 2.609376 std 630.043777 min -6533.448873 25% 0.120986 50% 0.427965 75% 1.037891 max 308210.189408 Name: Average_Merchant_amount_30days_Ratio, dtype: float64
# Define variables and quantile range
var = 'Average_Merchant_amount_30days_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]
# Create the binned variable
df1[binned_var] = pd.qcut(df1[var], percentile)
# Create the bin labels
df1['bin_label'] = df1[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df1['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)
# Update the layout
fig.update_layout(title='Distribution of Average Merchant Amount in last 30 days Ratio', xaxis_title='Quantile Range', yaxis_title='Count', template = 'simple_white'
)
# Show the plot
fig.show()
3.1.3 Avg Amount by Agency, Merchant Ratio in last 60 days ¶
# Feature 3: Average Amount by Agency, Merchant in last 60 days
table_3 = calculate_avg_amount(calculate_last_n_days(df, 60), agency_merchant_cols)
table_3 = table_3.rename(columns={'Avg_Amount': 'Average_Merchant_amount_60days'})
table_3.head()
| Agency_Name | Merchant_Category | Average_Merchant_amount_60days | |
|---|---|---|---|
| 0 | ATTORNEY GENERAL | ADVERTISING SERVICES | 1040.000000 |
| 1 | ATTORNEY GENERAL | AMERICAN AIRLINES | 465.248125 |
| 2 | ATTORNEY GENERAL | BOOK STORES | 123.225000 |
| 3 | ATTORNEY GENERAL | BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED | 332.034545 |
| 4 | ATTORNEY GENERAL | CABLE, SATELLITE, AND OTHER PAY TELEVISION AND... | 118.990000 |
table_3_stats = calculate_descriptive_stats(table_3, 'Average_Merchant_amount_60days')
table_3_stats
| Average_Merchant_amount_60days | |
|---|---|
| mean | 542.288064 |
| min | -995.600000 |
| 25% | 96.037500 |
| 50% | 265.452500 |
| 75% | 551.635076 |
| 90% | 1079.433628 |
| 95% | 1742.393225 |
| max | 134734.050000 |
# Add benchmarks back to original df
df1 = pd.merge(df1, table_3, how='left', on=['Agency_Name', 'Merchant_Category'])
df1['Average_Merchant_amount_60days'].fillna(0, inplace=True)
# Calculate the ratio
df1['Average_Merchant_amount_60days_Ratio'] = np.where(df1['Average_Merchant_amount_60days']==0,0, df1['Amount']/df1['Average_Merchant_amount_60days'] )
df1['Average_Merchant_amount_60days_Ratio'].describe()
count 4.424580e+05 mean 7.628855e+11 std 4.358540e+15 min -8.104228e+17 25% 1.399177e-01 50% 4.544632e-01 75% 1.046130e+00 max 8.104228e+17 Name: Average_Merchant_amount_60days_Ratio, dtype: float64
# Define variables and quantile range
var = 'Average_Merchant_amount_60days_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]
# Create the binned variable
df1[binned_var] = pd.qcut(df1[var], percentile)
# Create the bin labels
df1['bin_label'] = df1[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df1['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)
# Update the layout
fig.update_layout(title='Distribution of Average Merchant Amount in last 60 days Ratio', xaxis_title='Quantile Range', yaxis_title='Count', template = 'simple_white'
)
# Show the plot
fig.show()
3.1.4 Avg Amount by Agency, Merchant Ratio in last 90 days ¶
# Feature 4: Average Amount by Agency, Merchant in last 90 days
table_4 = calculate_avg_amount(calculate_last_n_days(df, 90), agency_merchant_cols)
table_4 = table_4.rename(columns={'Avg_Amount': 'Average_Merchant_amount_90days'})
table_4.head()
| Agency_Name | Merchant_Category | Average_Merchant_amount_90days | |
|---|---|---|---|
| 0 | ATTORNEY GENERAL | ADVERTISING SERVICES | 1040.000000 |
| 1 | ATTORNEY GENERAL | AMERICAN AIRLINES | 417.765667 |
| 2 | ATTORNEY GENERAL | AUTOMOTIVE PARTS AND ACCESSORIES STORES | 97.215000 |
| 3 | ATTORNEY GENERAL | BOOK STORES | 123.225000 |
| 4 | ATTORNEY GENERAL | BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED | 299.336923 |
table_4_stats = calculate_descriptive_stats(table_4, 'Average_Merchant_amount_90days')
table_4_stats
| Average_Merchant_amount_90days | |
|---|---|
| mean | 538.989403 |
| min | -842.720000 |
| 25% | 102.350694 |
| 50% | 270.000000 |
| 75% | 555.795938 |
| 90% | 1061.350727 |
| 95% | 1641.144750 |
| max | 214206.120000 |
# Merge dfs
df1 = pd.merge(df1, table_4, how='left', on=['Agency_Name', 'Merchant_Category'])
df1['Average_Merchant_amount_90days'].fillna(0, inplace=True)
# Calculate the ratio
df1['Average_Merchant_amount_90days_Ratio'] = np.where(df1['Average_Merchant_amount_90days']==0,0, df1['Amount']/df1['Average_Merchant_amount_90days'])
df1['Average_Merchant_amount_90days_Ratio'].describe()
count 4.424580e+05 mean 7.628855e+11 std 4.358540e+15 min -8.104228e+17 25% 1.480874e-01 50% 4.679337e-01 75% 1.067796e+00 max 8.104228e+17 Name: Average_Merchant_amount_90days_Ratio, dtype: float64
# Define variables and quantile range
var = 'Average_Merchant_amount_90days_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]
# Create the binned variable
df1[binned_var] = pd.qcut(df1[var], percentile)
# Create the bin labels
df1['bin_label'] = df1[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df1['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)
# Update the layout
fig.update_layout(title='Distribution of Average Merchant Amount in last 90 days Ratio', xaxis_title='Quantile Range', yaxis_title='Count', template = 'simple_white'
)
# Show the plot
fig.show()
3.1.5 Max Amount by Agency, Merchant Ratio in last 30 days ¶
# Feature 5: Max Amount by Agency, Merchant in last 30 days
table_5 = calculate_max_amount(calculate_last_n_days(df, 30), agency_merchant_cols)
table_5 = table_5.rename(columns={'Max_Amount': 'Max_Merchant_amount_30days'})
table_5.head()
| Agency_Name | Merchant_Category | Max_Merchant_amount_30days | |
|---|---|---|---|
| 0 | ATTORNEY GENERAL | ADVERTISING SERVICES | 1040.00 |
| 1 | ATTORNEY GENERAL | AMERICAN AIRLINES | 982.00 |
| 2 | ATTORNEY GENERAL | BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED | 1000.00 |
| 3 | ATTORNEY GENERAL | CABLE, SATELLITE, AND OTHER PAY TELEVISION AND... | 24.59 |
| 4 | ATTORNEY GENERAL | CHARITABLE AND SOCIAL SERVICE ORGANIZATIONS | 1325.00 |
table_5_stats = calculate_descriptive_stats(table_5, 'Max_Merchant_amount_30days')
table_5_stats
| Max_Merchant_amount_30days | |
|---|---|
| mean | 1552.831412 |
| min | -2393.700000 |
| 25% | 118.747500 |
| 50% | 425.155000 |
| 75% | 1188.075000 |
| 90% | 3002.620000 |
| 95% | 4783.725000 |
| max | 132790.140000 |
# Merge dataset
df1 = pd.merge(df1, table_5, how='left', on=['Agency_Name', 'Merchant_Category'])
df1['Max_Merchant_amount_30days'].fillna(0, inplace=True)
# Calculate the ratio
df1['Max_Merchant_amount_30days_Ratio'] = np.where(df1['Max_Merchant_amount_30days']==0,0, df1['Amount']/df1['Max_Merchant_amount_30days'])
df1['Max_Merchant_amount_30days_Ratio'].describe()
count 442458.000000 mean 0.694637 std 145.885707 min -6533.448873 25% 0.010224 50% 0.047619 75% 0.193889 max 70907.201862 Name: Max_Merchant_amount_30days_Ratio, dtype: float64
# Define variables and quantile range
var = 'Max_Merchant_amount_30days_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]
# Create the binned variable
df1[binned_var] = pd.qcut(df1[var], percentile)
# Create the bin labels
df1['bin_label'] = df1[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df1['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)
# Update the layout
fig.update_layout(title='Distribution of Max Merchant Amount in last 30 days Ratio', xaxis_title='Quantile Range', yaxis_title='Count', template = 'simple_white'
)
# Show the plot
fig.show()
3.1.6 Max Amount by Agency, Merchant Ratio in last 60 days ¶
# Feature 6: Max Amount by Agency, Merchant in last 60 days
table_6 = calculate_max_amount(calculate_last_n_days(df, 60), agency_merchant_cols)
table_6 = table_6.rename(columns={'Max_Amount': 'Max_Merchant_amount_60days'})
table_6.head()
| Agency_Name | Merchant_Category | Max_Merchant_amount_60days | |
|---|---|---|---|
| 0 | ATTORNEY GENERAL | ADVERTISING SERVICES | 1040.00 |
| 1 | ATTORNEY GENERAL | AMERICAN AIRLINES | 982.00 |
| 2 | ATTORNEY GENERAL | BOOK STORES | 208.50 |
| 3 | ATTORNEY GENERAL | BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED | 1000.00 |
| 4 | ATTORNEY GENERAL | CABLE, SATELLITE, AND OTHER PAY TELEVISION AND... | 306.66 |
table_6_stats = calculate_descriptive_stats(table_6, 'Max_Merchant_amount_60days')
table_6_stats
| Max_Merchant_amount_60days | |
|---|---|
| mean | 1794.498448 |
| min | -995.600000 |
| 25% | 154.360000 |
| 50% | 488.925000 |
| 75% | 1340.890000 |
| 90% | 3515.500000 |
| 95% | 4919.054500 |
| max | 348053.750000 |
# Merge dataset
df1 = pd.merge(df1, table_6, how='left', on=['Agency_Name', 'Merchant_Category'])
df1['Max_Merchant_amount_60days'].fillna(0, inplace=True)
# Calculate the ratio
df1['Max_Merchant_amount_60days_Ratio'] = np.where(df1['Max_Merchant_amount_60days']==0,0, df1['Amount']/df1['Max_Merchant_amount_60days'])
df1['Max_Merchant_amount_60days_Ratio'].describe()
count 442458.000000 mean 0.610762 std 145.128071 min -853.071672 25% 0.009709 50% 0.040093 75% 0.157241 max 70907.201862 Name: Max_Merchant_amount_60days_Ratio, dtype: float64
# Define variables and quantile range
var = 'Max_Merchant_amount_60days_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]
# Create the binned variable
df1[binned_var] = pd.qcut(df1[var], percentile)
# Create the bin labels
df1['bin_label'] = df1[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df1['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)
# Update the layout
fig.update_layout(title='Distribution of Max Merchant Amount in last 60 days Ratio', xaxis_title='Quantile Range', yaxis_title='Count', template = 'simple_white'
)
# Show the plot
fig.show()
3.1.7 Max Amount by Agency, Merchant Ratio in last 90 days ¶
# Feature 7: Max Amount by Agency, Merchant in last 90 days
table_7 = calculate_max_amount(calculate_last_n_days(df, 90), agency_merchant_cols)
table_7 = table_7.rename(columns={'Max_Amount': 'Max_Merchant_amount_90days'})
table_7.head()
| Agency_Name | Merchant_Category | Max_Merchant_amount_90days | |
|---|---|---|---|
| 0 | ATTORNEY GENERAL | ADVERTISING SERVICES | 1040.00 |
| 1 | ATTORNEY GENERAL | AMERICAN AIRLINES | 982.00 |
| 2 | ATTORNEY GENERAL | AUTOMOTIVE PARTS AND ACCESSORIES STORES | 186.17 |
| 3 | ATTORNEY GENERAL | BOOK STORES | 208.50 |
| 4 | ATTORNEY GENERAL | BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED | 1000.00 |
table_7_stats = calculate_descriptive_stats(table_7, 'Max_Merchant_amount_90days')
table_7_stats
| Max_Merchant_amount_90days | |
|---|---|
| mean | 1906.054049 |
| min | -842.720000 |
| 25% | 166.000000 |
| 50% | 513.520000 |
| 75% | 1448.627500 |
| 90% | 3634.704000 |
| 95% | 4982.698500 |
| max | 373150.260000 |
# Merge dataset
df1 = pd.merge(df1, table_7, how='left', on=['Agency_Name', 'Merchant_Category'])
df1['Max_Merchant_amount_90days'].fillna(0, inplace=True)
# Calculate the ratio
df1['Max_Merchant_amount_90days_Ratio'] = np.where(df1['Max_Merchant_amount_90days']==0,0, df1['Amount']/df1['Max_Merchant_amount_90days'])
df1['Max_Merchant_amount_90days_Ratio'].describe()
count 442458.000000 mean 0.533116 std 144.864992 min -464.136871 25% 0.009191 50% 0.035351 75% 0.138708 max 70907.201862 Name: Max_Merchant_amount_90days_Ratio, dtype: float64
# Define variables and quantile range
var = 'Max_Merchant_amount_90days_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]
# Create the binned variable
df1[binned_var] = pd.qcut(df1[var], percentile)
# Create the bin labels
df1['bin_label'] = df1[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df1['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)
# Update the layout
fig.update_layout(title='Distribution of Max Merchant Amount in last 90 days Ratio', xaxis_title='Quantile Range', yaxis_title='Count', template = 'simple_white'
)
# Show the plot
fig.show()
3.1.8 Avg Transaction by Agency, Merchant Ratio ¶
# Feature 8: Average Transaction by Agency, Merchant
table_8 = calculate_avg_transaction(df, agency_merchant_cols)
table_8 = table_8.rename(columns={'Avg_Transaction_Count': 'Avg_Transaction'})
table_8.head()
| Agency_Name | Merchant_Category | Avg_Transaction | |
|---|---|---|---|
| 0 | ARDMORE HIGHER EDUCATION CENTER | BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED | 1 |
| 1 | ARDMORE HIGHER EDUCATION CENTER | CATALOG MERCHANTS | 7 |
| 2 | ARDMORE HIGHER EDUCATION CENTER | COMPUTER NETWORK/INFORMATION SERVICES | 1 |
| 3 | ARDMORE HIGHER EDUCATION CENTER | DRUG STORES AND PHARMACIES | 1 |
| 4 | ARDMORE HIGHER EDUCATION CENTER | GROCERY STORES,AND SUPERMARKETS | 11 |
table_8_stats = calculate_descriptive_stats(table_8, 'Avg_Transaction')
table_8_stats
| Avg_Transaction | |
|---|---|
| mean | 52.504806 |
| min | 1.000000 |
| 25% | 2.000000 |
| 50% | 5.000000 |
| 75% | 17.000000 |
| 90% | 67.000000 |
| 95% | 168.700000 |
| max | 8102.000000 |
baseline_transaction=table_8['Avg_Transaction'].mean()
# Add average medicare payments by states to the dataset
df1 = pd.merge(df1, table_8, how='left', on=['Agency_Name', 'Merchant_Category'])
# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df1['Avg_Transaction_Ratio'] = np.where(df1['Avg_Transaction']==0,0, baseline_transaction/df1['Avg_Transaction'])
df1['Avg_Transaction_Ratio'].describe()
count 442458.000000 mean 1.000000 std 4.235557 min 0.006480 25% 0.020960 50% 0.070666 75% 0.352381 max 52.504806 Name: Avg_Transaction_Ratio, dtype: float64
# Define variables and quantile range
var = 'Avg_Transaction_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]
# Create the binned variable
df1[binned_var] = pd.qcut(df1[var], percentile, duplicates='drop')
# Create the bin labels
df1['bin_label'] = df1[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df1['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)
# Update the layout
fig.update_layout(title='Distribution of Average Transaction Count Ratio', xaxis_title='Quantile Range', yaxis_title='Count', template = 'simple_white'
)
# Show the plot
fig.show()
3.1.9 Avg Transaction by Agency, Merchant Ratio in last 30 days ¶
table_9=calculate_avg_transaction(calculate_last_n_days(df, 30), agency_merchant_cols)
table_9 = table_9.rename(columns={'Avg_Transaction_Count': 'Avg_Transaction_30days'})
table_9.head()
| Agency_Name | Merchant_Category | Avg_Transaction_30days | |
|---|---|---|---|
| 0 | ATTORNEY GENERAL | ADVERTISING SERVICES | 1 |
| 1 | ATTORNEY GENERAL | AMERICAN AIRLINES | 7 |
| 2 | ATTORNEY GENERAL | BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED | 7 |
| 3 | ATTORNEY GENERAL | CABLE, SATELLITE, AND OTHER PAY TELEVISION AND... | 1 |
| 4 | ATTORNEY GENERAL | CHARITABLE AND SOCIAL SERVICE ORGANIZATIONS | 3 |
table_9_stats = calculate_descriptive_stats(table_9, 'Avg_Transaction_30days')
table_9_stats
| Avg_Transaction_30days | |
|---|---|
| mean | 10.469158 |
| min | 1.000000 |
| 25% | 1.000000 |
| 50% | 2.000000 |
| 75% | 6.000000 |
| 90% | 19.000000 |
| 95% | 38.000000 |
| max | 643.000000 |
# Add average medicare payments by states to the dataset
df1 = pd.merge(df1, table_9, how='left', on=['Agency_Name', 'Merchant_Category'])
df1['Avg_Transaction_30days'].fillna(0, inplace=True)
# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df1['Avg_Transaction_30days_Ratio'] = np.where(df1['Avg_Transaction_30days']==0,0, baseline_transaction/df1['Avg_Transaction_30days'])
df1['Avg_Transaction_30days_Ratio'].describe()
count 442458.000000 mean 4.430460 std 10.344312 min 0.000000 25% 0.195913 50% 0.625057 75% 2.763411 max 52.504806 Name: Avg_Transaction_30days_Ratio, dtype: float64
# Define variables and quantile range
var = 'Avg_Transaction_30days_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]
# Create the binned variable
df1[binned_var] = pd.qcut(df1[var], percentile, duplicates='drop')
# Create the bin labels
df1['bin_label'] = df1[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df1['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)
# Update the layout
fig.update_layout(title='Distribution of Average Transaction Count in last 30 days Ratio', xaxis_title='Quantile Range', yaxis_title='Count', template = 'simple_white'
)
# Show the plot
fig.show()
3.1.10 Avg Transaction by Agency, Merchant Ratio in last 60 days ¶
table_10=calculate_avg_transaction(calculate_last_n_days(df, 60), agency_merchant_cols)
table_10 = table_10.rename(columns={'Avg_Transaction_Count': 'Avg_Transaction_60days'})
table_10.head()
| Agency_Name | Merchant_Category | Avg_Transaction_60days | |
|---|---|---|---|
| 0 | ATTORNEY GENERAL | ADVERTISING SERVICES | 1 |
| 1 | ATTORNEY GENERAL | AMERICAN AIRLINES | 16 |
| 2 | ATTORNEY GENERAL | BOOK STORES | 2 |
| 3 | ATTORNEY GENERAL | BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED | 11 |
| 4 | ATTORNEY GENERAL | CABLE, SATELLITE, AND OTHER PAY TELEVISION AND... | 3 |
table_10_stats = calculate_descriptive_stats(table_10, 'Avg_Transaction_60days')
table_10_stats
| Avg_Transaction_60days | |
|---|---|
| mean | 15.985294 |
| min | 1.000000 |
| 25% | 1.000000 |
| 50% | 3.000000 |
| 75% | 8.000000 |
| 90% | 26.100000 |
| 95% | 56.000000 |
| max | 1302.000000 |
# Add average medicare payments by states to the dataset
df1 = pd.merge(df1, table_10, how='left', on=['Agency_Name', 'Merchant_Category'])
df1['Avg_Transaction_60days'].fillna(0, inplace=True)
# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df1['Avg_Transaction_60days_Ratio'] = np.where(df1['Avg_Transaction_60days']==0,0, baseline_transaction/ df1['Avg_Transaction_60days'])
df1['Avg_Transaction_60days_Ratio'].describe()
count 442458.000000 mean 3.022885 std 8.245084 min 0.000000 25% 0.109613 50% 0.362102 75% 1.640775 max 52.504806 Name: Avg_Transaction_60days_Ratio, dtype: float64
# Define variables and quantile range
var = 'Avg_Transaction_60days_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]
# Create the binned variable
df1[binned_var] = pd.qcut(df1[var], percentile, duplicates='drop')
# Create the bin labels
df1['bin_label'] = df1[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df1['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)
# Update the layout
fig.update_layout(title='Distribution of Average Transaction Count in last 60 days Ratio', xaxis_title='Quantile Range', yaxis_title='Count', template = 'simple_white'
)
# Show the plot
fig.show()
3.1.11 Avg Transaction by Agency, Merchant Ratio in last 90 days ¶
table_11=calculate_avg_transaction(calculate_last_n_days(df, 90), agency_merchant_cols)
table_11 = table_11.rename(columns={'Avg_Transaction_Count': 'Avg_Transaction_90days'})
table_11.head()
| Agency_Name | Merchant_Category | Avg_Transaction_90days | |
|---|---|---|---|
| 0 | ATTORNEY GENERAL | ADVERTISING SERVICES | 1 |
| 1 | ATTORNEY GENERAL | AMERICAN AIRLINES | 30 |
| 2 | ATTORNEY GENERAL | AUTOMOTIVE PARTS AND ACCESSORIES STORES | 2 |
| 3 | ATTORNEY GENERAL | BOOK STORES | 2 |
| 4 | ATTORNEY GENERAL | BUSINESS SERVICES NOT ELSEWHERE CLASSIFIED | 13 |
table_11_stats = calculate_descriptive_stats(table_11, 'Avg_Transaction_90days')
table_11_stats
| Avg_Transaction_90days | |
|---|---|
| mean | 21.052301 |
| min | 1.000000 |
| 25% | 1.000000 |
| 50% | 3.000000 |
| 75% | 10.000000 |
| 90% | 33.000000 |
| 95% | 74.150000 |
| max | 2146.000000 |
# Add average medicare payments by states to the dataset
df1 = pd.merge(df1, table_11, how='left', on=['Agency_Name', 'Merchant_Category'])
df1['Avg_Transaction_90days'].fillna(0, inplace=True)
# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df1['Avg_Transaction_90days_Ratio'] = np.where(df1['Avg_Transaction_90days']==0,0, baseline_transaction/df1['Avg_Transaction_90days'])
df1['Avg_Transaction_90days_Ratio'].describe()
count 442458.000000 mean 2.336722 std 7.166343 min 0.000000 25% 0.073536 50% 0.245350 75% 1.166773 max 52.504806 Name: Avg_Transaction_90days_Ratio, dtype: float64
# Define variables and quantile range
var = 'Avg_Transaction_90days_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]
# Create the binned variable
df1[binned_var] = pd.qcut(df1[var], percentile, duplicates='drop')
# Create the bin labels
df1['bin_label'] = df1[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df1['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)
# Update the layout
fig.update_layout(title='Distribution of Average Transaction Count in last 90 days Ratio', xaxis_title='Quantile Range', yaxis_title='Count', template = 'simple_white'
)
# Show the plot
fig.show()
3.1.12 Avg Amount by Agency, Vendor Ratio ¶
table_12 = calculate_avg_amount(df, agency_vendor_cols)
table_12 = table_12.rename(columns={'Avg_Amount': 'Average_Vendor_amount'})
table_12.head()
| Agency_Name | Vendor | Average_Vendor_amount | |
|---|---|---|---|
| 0 | ARDMORE HIGHER EDUCATION CENTER | AGRI PRODUCTS | 92.500000 |
| 1 | ARDMORE HIGHER EDUCATION CENTER | CDW GOVERNMENT | -53.344286 |
| 2 | ARDMORE HIGHER EDUCATION CENTER | COUNTY BUILDING CE | 58.440000 |
| 3 | ARDMORE HIGHER EDUCATION CENTER | EMPIRE PAPER COMPANY | 443.540000 |
| 4 | ARDMORE HIGHER EDUCATION CENTER | GODADDY.COM | 18.340000 |
table_12_stats = calculate_descriptive_stats(table_12, 'Average_Vendor_amount')
table_12_stats
| Average_Vendor_amount | |
|---|---|
| mean | 4.471355e+02 |
| min | -4.907220e+03 |
| 25% | 2.900000e+01 |
| 50% | 1.561667e+02 |
| 75% | 4.560000e+02 |
| 90% | 8.486420e+02 |
| 95% | 1.432209e+03 |
| max | 1.827119e+06 |
# Add average medicare payments by states to the dataset
df1 = pd.merge(df1, table_12, how='left', on=['Agency_Name', 'Vendor'])
# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df1['Average_Vendor_amount_Ratio'] = np.where(df1['Average_Vendor_amount']==0,0, df1['Amount']/df1['Average_Vendor_amount'])
df1['Average_Vendor_amount_Ratio'].describe()
count 4.424580e+05 mean 9.931041e-01 std 1.963144e+14 min -6.944551e+16 25% 3.399096e-01 50% 9.543105e-01 75% 1.040700e+00 max 6.424385e+16 Name: Average_Vendor_amount_Ratio, dtype: float64
# Define variables and quantile range
var = 'Average_Vendor_amount_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]
# Create the binned variable
df1[binned_var] = pd.qcut(df1[var], percentile, duplicates='drop')
# Create the bin labels
df1['bin_label'] = df1[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df1['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)
# Update the layout
fig.update_layout(title='Distribution of Average Vendor Ratio', xaxis_title='Quantile Range', yaxis_title='Count', template = 'simple_white'
)
# Show the plot
fig.show()
3.1.13 Avg Amount by Agency, Vendor Ratio in last 30 days ¶
table_13=calculate_avg_amount(calculate_last_n_days(df, 30), agency_vendor_cols)
table_13 = table_13.rename(columns={'Avg_Amount': 'Average_Vendor_amount_30days'})
table_13.head()
| Agency_Name | Vendor | Average_Vendor_amount_30days | |
|---|---|---|---|
| 0 | ATTORNEY GENERAL | ADI ASPEN PUBLISHERS | 866.20 |
| 1 | ATTORNEY GENERAL | AMERICAN AI 0017450939750 | -250.03 |
| 2 | ATTORNEY GENERAL | AMERICAN AI 0017456015142 | 982.00 |
| 3 | ATTORNEY GENERAL | AMERICAN AI 0017459356062 | 669.00 |
| 4 | ATTORNEY GENERAL | AMERICAN AI 0017459356079 | 687.50 |
table_13_stats = calculate_descriptive_stats(table_13, 'Average_Vendor_amount_30days')
table_13_stats
| Average_Vendor_amount_30days | |
|---|---|
| mean | 507.333145 |
| min | -4987.040000 |
| 25% | 40.680000 |
| 50% | 162.802778 |
| 75% | 474.000000 |
| 90% | 1031.480000 |
| 95% | 1822.994000 |
| max | 60347.380000 |
# Add average medicare payments by states to the dataset
df1 = pd.merge(df1, table_13, how='left', on=['Agency_Name', 'Vendor'])
df1['Average_Vendor_amount_30days'].fillna(0, inplace=True)
# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df1['Average_Vendor_amount_30days_Ratio'] = np.where(df1['Average_Vendor_amount_30days']==0,0, df1['Amount']/ df1['Average_Vendor_amount_30days'])
df1['Average_Vendor_amount_30days_Ratio'].describe()
count 4.424580e+05 mean 8.161432e-01 std 1.424394e+14 min -6.944551e+16 25% 0.000000e+00 50% 1.390245e-01 75% 8.809942e-01 max 6.424385e+16 Name: Average_Vendor_amount_30days_Ratio, dtype: float64
# Define variables and quantile range
var = 'Average_Vendor_amount_30days_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]
# Create the binned variable
df1[binned_var] = pd.qcut(df1[var], percentile, duplicates='drop')
# Create the bin labels
df1['bin_label'] = df1[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df1['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)
# Update the layout
fig.update_layout(title='Distribution of Average Vendor in last 30 days Ratio', xaxis_title='Quantile Range', yaxis_title='Count', template = 'simple_white'
)
# Show the plot
fig.show()
3.1.14 Avg Amount by Agency, Vendor Ratio in last 60 days ¶
table_14 =calculate_avg_amount(calculate_last_n_days(df, 60), agency_vendor_cols)
table_14 = table_14.rename(columns={'Avg_Amount': 'Average_Vendor_amount_60days'})
table_14.head()
| Agency_Name | Vendor | Average_Vendor_amount_60days | |
|---|---|---|---|
| 0 | ATTORNEY GENERAL | ADI ASPEN PUBLISHERS | 866.200 |
| 1 | ATTORNEY GENERAL | AMAZON MKTPLACE PMTS | 123.225 |
| 2 | ATTORNEY GENERAL | AMERICAN AI 0017450939750 | 205.985 |
| 3 | ATTORNEY GENERAL | AMERICAN AI 0017450939785 | -361.000 |
| 4 | ATTORNEY GENERAL | AMERICAN AI 0017450939901 | 250.000 |
table_14_stats = calculate_descriptive_stats(table_14, 'Average_Vendor_amount_60days')
table_14_stats
| Average_Vendor_amount_60days | |
|---|---|
| mean | 492.806725 |
| min | -4999.000000 |
| 25% | 37.000000 |
| 50% | 162.000000 |
| 75% | 475.000000 |
| 90% | 999.735000 |
| 95% | 1734.096500 |
| max | 348053.750000 |
# Add average medicare payments by states to the dataset
df1 = pd.merge(df1, table_14, how='left', on=['Agency_Name', 'Vendor'])
df1['Average_Vendor_amount_60days'].fillna(0, inplace=True)
# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df1['Average_Vendor_amount_60days_Ratio'] = np.where(df1['Average_Vendor_amount_60days']==0,0, df1['Amount']/df1['Average_Vendor_amount_60days'])
df1['Average_Vendor_amount_60days_Ratio'].describe()
count 4.424580e+05 mean 8.131557e-01 std 1.424394e+14 min -6.944551e+16 25% 0.000000e+00 50% 2.655243e-01 75% 1.000000e+00 max 6.424385e+16 Name: Average_Vendor_amount_60days_Ratio, dtype: float64
# Define variables and quantile range
var = 'Average_Vendor_amount_60days_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]
# Create the binned variable
df1[binned_var] = pd.qcut(df1[var], percentile, duplicates='drop')
# Create the bin labels
df1['bin_label'] = df1[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df1['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)
# Update the layout
fig.update_layout(title='Distribution of Average Vendor in last 60 days Ratio', xaxis_title='Quantile Range', yaxis_title='Count', template = 'simple_white'
)
# Show the plot
fig.show()
3.1.15 Avg Amount by Agency, Vendor Ratio in last 90 days ¶
table_15 =calculate_avg_amount(calculate_last_n_days(df, 90), agency_vendor_cols)
table_15 = table_15.rename(columns={'Avg_Amount': 'Average_Vendor_amount_90days'})
table_15.head()
| Agency_Name | Vendor | Average_Vendor_amount_90days | |
|---|---|---|---|
| 0 | ATTORNEY GENERAL | ADI ASPEN PUBLISHERS | 866.200 |
| 1 | ATTORNEY GENERAL | AMAZON MKTPLACE PMTS | 123.225 |
| 2 | ATTORNEY GENERAL | AMERICAN AI 0017394725288 | 444.000 |
| 3 | ATTORNEY GENERAL | AMERICAN AI 0017395886798 | 673.000 |
| 4 | ATTORNEY GENERAL | AMERICAN AI 0017395886825 | 0.000 |
table_15_stats = calculate_descriptive_stats(table_15, 'Average_Vendor_amount_90days')
table_15_stats
| Average_Vendor_amount_90days | |
|---|---|
| mean | 467.539420 |
| min | -3889.570000 |
| 25% | 36.185000 |
| 50% | 164.430000 |
| 75% | 478.858333 |
| 90% | 975.000000 |
| 95% | 1669.614000 |
| max | 298416.860000 |
# Add average medicare payments by states to the dataset
df1 = pd.merge(df1, table_15, how='left', on=['Agency_Name', 'Vendor'])
df1['Average_Vendor_amount_90days'].fillna(0, inplace=True)
# Calculate the ratio of the Average medicare payments to the Average medicare payments by states
df1['Average_Vendor_amount_90days_Ratio'] = np.where(df1['Average_Vendor_amount_90days']==0,0, df1['Amount']/df1['Average_Vendor_amount_90days'])
df1['Average_Vendor_amount_90days_Ratio'].describe()
count 4.424580e+05 mean 8.885308e-01 std 1.424394e+14 min -6.944551e+16 25% 0.000000e+00 50% 3.731346e-01 75% 1.000000e+00 max 6.424385e+16 Name: Average_Vendor_amount_90days_Ratio, dtype: float64
# Define variables and quantile range
var = 'Average_Vendor_amount_90days_Ratio'
binned_var = var + '_bin'
percentile = [0,0.01,0.05,0.2,0.5,0.8,0.95,0.99,1]
# Create the binned variable
df1[binned_var] = pd.qcut(df1[var], percentile, duplicates='drop')
# Create the bin labels
df1['bin_label'] = df1[binned_var].apply(lambda x: f'{x.left:.2f} - {x.right:.2f}')
# Count the occurrences in each bin
bin_counts = df1['bin_label'].value_counts().sort_index().reset_index()
bin_counts.columns = ['bin_label', 'count']
# Create the plot
fig = px.bar(bin_counts, x='bin_label', y='count', labels={'bin_label': 'Quantile Range', 'count': 'Count'},
color=bin_counts['bin_label'], color_discrete_sequence=px.colors.sequential.Plotly3)
# Update the layout
fig.update_layout(title='Distribution of Average Vendor in last 90 days Ratio', xaxis_title='Quantile Range', yaxis_title='Count', template = 'simple_white'
)
# Show the plot
fig.show()
Data Preparation for Modeling¶
# Extracting only the ratio columns
ratio_columns = [col for col in df1.columns if col.endswith('Ratio')]
X = df1[ratio_columns]
X.head()
| Average_Merchant_amount_Ratio | Average_Merchant_amount_30days_Ratio | Average_Merchant_amount_60days_Ratio | Average_Merchant_amount_90days_Ratio | Max_Merchant_amount_30days_Ratio | Max_Merchant_amount_60days_Ratio | Max_Merchant_amount_90days_Ratio | Avg_Transaction_Ratio | Avg_Transaction_30days_Ratio | Avg_Transaction_60days_Ratio | Avg_Transaction_90days_Ratio | Average_Vendor_amount_Ratio | Average_Vendor_amount_30days_Ratio | Average_Vendor_amount_60days_Ratio | Average_Vendor_amount_90days_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.081219 | 1.741068 | 1.929814 | 2.279633 | 0.404545 | 0.263002 | 0.190098 | 0.050388 | 0.719244 | 0.354762 | 0.190927 | 1.243789 | 0.000000 | 0.000000 | 0.000000 |
| 1 | 0.974461 | 0.564751 | 0.715171 | 0.562686 | 0.208580 | 0.208580 | 0.057642 | 0.290082 | 4.375400 | 2.019416 | 1.280605 | 0.781845 | 0.000000 | 1.257404 | 0.281362 |
| 2 | 0.422515 | 0.378935 | 0.394945 | 0.404276 | 0.033164 | 0.033164 | 0.033164 | 0.024455 | 0.312529 | 0.157200 | 0.104383 | 1.801787 | 0.000000 | -6.842170 | 1.974195 |
| 3 | 1.611007 | 1.704157 | 1.614249 | 1.554416 | 0.117421 | 0.117421 | 0.115509 | 0.006480 | 0.081656 | 0.040326 | 0.024466 | 1.544294 | 1.578127 | 1.714738 | 1.609596 |
| 4 | 0.464432 | 0.330686 | 0.422823 | 0.436213 | 0.026625 | 0.026625 | 0.026625 | 0.017822 | 0.256121 | 0.125011 | 0.078016 | 0.876658 | 0.663009 | 0.747109 | 0.876767 |
# Splitting data into train and test sets for X (features only)
X_train, X_test = train_test_split(X, test_size=0.2, random_state=123)
X_train_norm, X_test_norm = standardizer(X_train, X_test)
Section 4: Histogram-Based Outlier Score (HBOS)¶
Approach:
- HBOS uses histograms to model the distribution of each feature independently. It divides the range of each feature into bins and computes the density of data points in each bin.
Scoring:
- The anomaly score is based on the inverse of the density; regions with lower density (fewer data points) are considered more anomalous.
Advantages:
- Simple and fast, particularly effective for univariate data or when features are independent.
Limitations:
- May not capture interactions between features effectively. Sensitive to bin width and outliers.
# Define functions
def count_stat(vector):
unique, counts = np.unique(vector, return_counts=True)
return dict(zip(unique, counts))
def descriptive_stat_threshold(df, pred_scores, threshold):
df = pd.DataFrame(df)
df['Anomaly_Score'] = pred_scores
df['Group'] = np.where(df['Anomaly_Score'] < threshold, 'Normal', 'Outlier')
# Calculate count and percentage
cnt = df.groupby('Group')['Anomaly_Score'].count().reset_index().rename(columns={'Anomaly_Score': 'Count'})
cnt['Count %'] = (cnt['Count'] / cnt['Count'].sum()) * 100
# Calculate mean statistics
stat = df.groupby('Group').mean().round(2).reset_index()
# Merge count and mean statistics
stat = cnt.merge(stat, left_on='Group',right_on='Group')
return stat
# Initialize parameters for hyperparameter tuning
param_grid = {'n_bins': [300, 600, 700], 'contamination': [0.01, 0.02, 0.03, 0.04, 0.05], 'alpha': [0.1, 0.2, 0.5, 1.0]}
4.1 HBOS¶
# Initialize HBOS model
hbos = HBOS()
# Perform grid search for hyperparameter tuning
grid_search = GridSearchCV(estimator=hbos, param_grid=param_grid, scoring='roc_auc')
grid_search.fit(X_train_norm)
# Get the best hyperparameters
best_params = grid_search.best_params_
# Use the best hyperparameters to train the final HBOS model
best_hbos = HBOS(**best_params)
best_hbos.fit(X_train_norm)
# Get the anomaly scores and predictions
y_train_scores = best_hbos.decision_function(X_train_norm)
y_train_pred = best_hbos.predict(X_train_norm)
y_test_scores = best_hbos.decision_function(X_test_norm)
y_test_pred = best_hbos.predict(X_test_norm)
# Print threshold and count statistics
print(f"Best hyperparameters for HBOS: {best_params}")
threshold = best_hbos.threshold_
print(f"The threshold for HBOS: {threshold}")
print(f"The training data (counts) for HBOS: {count_stat(y_train_pred)}")
print(f"The test data (counts) for HBOS: {count_stat(y_test_pred)}", '\n')
Best hyperparameters for HBOS: {'alpha': 0.1, 'contamination': 0.01, 'n_bins': 300}
The threshold for HBOS: 16.35766384715818
The training data (counts) for HBOS: {0: 350458, 1: 3508}
The test data (counts) for HBOS: {0: 87620, 1: 872}
# Function to plot histograms with threshold annotation and color differentiation
def plot_histogram_with_threshold(scores, title, threshold):
fig = go.Figure()
# Split the scores into two parts: below the threshold and above/equal to the threshold
below_threshold = scores[scores < threshold]
above_threshold = scores[scores >= threshold]
# Add histogram traces for the two parts
fig.add_trace(go.Histogram(x=below_threshold, name='Normal', marker=dict(color='#2b7574')))
fig.add_trace(go.Histogram(x=above_threshold, name='Outliers', marker=dict(color='#c83a21')))
# Add vertical line annotating the threshold
fig.add_vline(x=threshold, line_dash="dash", line_color="#0d5087", annotation_text="Threshold",
annotation_position="top right", annotation=dict(font=dict(color="#27496d", size=12)))
# Update layout
fig.update_layout(
title=title,
xaxis_title="Anomaly Score",
yaxis_title="Frequency",
showlegend=True,
height=450,
width=800,
template='plotly_white',
paper_bgcolor='rgba(0,0,0,0)',
plot_bgcolor='rgba(0,0,0,0)'
)
# Show the plot
fig.show()
# Plotting the combination by average for test data
plot_histogram_with_threshold(np.array(y_test_scores), "Anomaly Scores for HBOS Test Data", threshold)
# Get the descriptive statistics for outliers
stats_train_hbos = descriptive_stat_threshold(X_train, y_train_scores, threshold)
stats_test_hbos = descriptive_stat_threshold(X_test, y_test_scores, threshold)
stats_train_hbos
| Group | Count | Count % | Average_Merchant_amount_Ratio | Average_Merchant_amount_30days_Ratio | Average_Merchant_amount_60days_Ratio | Average_Merchant_amount_90days_Ratio | Max_Merchant_amount_30days_Ratio | Max_Merchant_amount_60days_Ratio | Max_Merchant_amount_90days_Ratio | Avg_Transaction_Ratio | Avg_Transaction_30days_Ratio | Avg_Transaction_60days_Ratio | Avg_Transaction_90days_Ratio | Average_Vendor_amount_Ratio | Average_Vendor_amount_30days_Ratio | Average_Vendor_amount_60days_Ratio | Average_Vendor_amount_90days_Ratio | Anomaly_Score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Normal | 350383 | 98.987756 | 1.01 | 1.13 | 2.293682e+12 | 2.293682e+12 | 0.31 | 0.24 | 0.21 | 0.96 | 4.21 | 2.87 | 2.21 | 2.446502e+11 | 0.81 | 0.81 | 0.88 | -0.01 |
| 1 | Outlier | 3583 | 1.012244 | 1.09 | 105.68 | 1.335400e+02 | 1.066500e+02 | 30.40 | 27.17 | 21.42 | 4.53 | 26.23 | 19.23 | 14.96 | -1.510915e+13 | 1.29 | 1.30 | 1.50 | 17.31 |
stats_test_hbos
| Group | Count | Count % | Average_Merchant_amount_Ratio | Average_Merchant_amount_30days_Ratio | Average_Merchant_amount_60days_Ratio | Average_Merchant_amount_90days_Ratio | Max_Merchant_amount_30days_Ratio | Max_Merchant_amount_60days_Ratio | Max_Merchant_amount_90days_Ratio | Avg_Transaction_Ratio | Avg_Transaction_30days_Ratio | Avg_Transaction_60days_Ratio | Avg_Transaction_90days_Ratio | Average_Vendor_amount_Ratio | Average_Vendor_amount_30days_Ratio | Average_Vendor_amount_60days_Ratio | Average_Vendor_amount_90days_Ratio | Anomaly_Score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Normal | 87603 | 98.995389 | 0.95 | 1.09 | -5.320852e+12 | -5.320852e+12 | 0.28 | 0.23 | 0.2 | 0.97 | 4.20 | 2.82 | 2.21 | -3.605491e+11 | 0.82 | 0.81 | 0.90 | 0.01 |
| 1 | Outlier | 889 | 1.004611 | 1.29 | 319.57 | 4.121900e+02 | 3.882400e+02 | 73.35 | 75.81 | 74.9 | 4.60 | 26.54 | 19.56 | 15.06 | 1.000000e+00 | 1.11 | 1.01 | 0.92 | 17.29 |
Key Takeaways
A total of 4,460 outliers were captured by the HBOS model which account for 1% of the train and test dataset respectively. Anomaly scores saw a large difference between normal and outlier groups.
4.2 Combination by Average ¶
# Test a range of binning
k_list = [50, 100, 150, 200, 300, 350, 400, 600, 700, 800]
n_clf = len(k_list)
n_bins= 300
# Just prepare data frames so we can store the model results
train_scores = np.zeros([X_train.shape[0], n_clf])
test_scores = np.zeros([X_test.shape[0], n_clf])
# Modeling
for i in range(n_clf):
k = k_list[i]
hbos_avg = HBOS(n_bins=n_bins)
hbos_avg.fit(X_train_norm)
# Store the results in each column:
train_scores[:, i] = hbos_avg.decision_function(X_train_norm)
test_scores[:, i] = hbos_avg.decision_function(X_test_norm)
# Decision scores have to be normalized before combination
train_scores_norm, test_scores_norm = standardizer(train_scores,test_scores)
# Combination by average
y_train_by_average = average(train_scores_norm)
y_test_by_average = average(test_scores_norm)
plt.hist(y_train_by_average, bins='auto') # arguments are passed to np.histogram
plt.title("Combination by average")
plt.show()
Based on the above histogram, I'll pick 2.1 as the threshold to identify the outlier.
plot_histogram_with_threshold(np.array(y_test_by_average), "Anomaly Scores for HBOS Combination by Average", 2.1)
# Get the descriptive statistics for outliers
stats_train_hbos_avg = descriptive_stat_threshold(X_train, y_train_by_average, 2.1)
stats_test_hbos_avg = descriptive_stat_threshold(X_test, y_test_by_average, 2.1)
stats_train_hbos_avg
| Group | Count | Count % | Average_Merchant_amount_Ratio | Average_Merchant_amount_30days_Ratio | Average_Merchant_amount_60days_Ratio | Average_Merchant_amount_90days_Ratio | Max_Merchant_amount_30days_Ratio | Max_Merchant_amount_60days_Ratio | Max_Merchant_amount_90days_Ratio | Avg_Transaction_Ratio | Avg_Transaction_30days_Ratio | Avg_Transaction_60days_Ratio | Avg_Transaction_90days_Ratio | Average_Vendor_amount_Ratio | Average_Vendor_amount_30days_Ratio | Average_Vendor_amount_60days_Ratio | Average_Vendor_amount_90days_Ratio | Anomaly_Score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Normal | 350105 | 98.909217 | 1.01 | 1.13 | 2.295504e+12 | 2.295504e+12 | 0.31 | 0.24 | 0.21 | 0.96 | 4.19 | 2.86 | 2.20 | 2.448445e+11 | 0.81 | 0.81 | 0.88 | -0.03 |
| 1 | Outlier | 3861 | 1.090783 | 1.10 | 98.09 | 1.242700e+02 | 9.925000e+01 | 28.00 | 25.30 | 19.94 | 4.34 | 26.08 | 18.78 | 14.62 | -1.402126e+13 | 1.33 | 1.32 | 1.50 | 2.27 |
stats_test_hbos_avg
| Group | Count | Count % | Average_Merchant_amount_Ratio | Average_Merchant_amount_30days_Ratio | Average_Merchant_amount_60days_Ratio | Average_Merchant_amount_90days_Ratio | Max_Merchant_amount_30days_Ratio | Max_Merchant_amount_60days_Ratio | Max_Merchant_amount_90days_Ratio | Avg_Transaction_Ratio | Avg_Transaction_30days_Ratio | Avg_Transaction_60days_Ratio | Avg_Transaction_90days_Ratio | Average_Vendor_amount_Ratio | Average_Vendor_amount_30days_Ratio | Average_Vendor_amount_60days_Ratio | Average_Vendor_amount_90days_Ratio | Anomaly_Score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Normal | 87533 | 98.916286 | 0.95 | 1.13 | -5.325107e+12 | -5.325107e+12 | 0.32 | 0.23 | 0.20 | 0.97 | 4.18 | 2.81 | 2.20 | -3.608375e+11 | 0.82 | 0.81 | 0.90 | -0.02 |
| 1 | Outlier | 959 | 1.083714 | 1.27 | 292.78 | 3.825800e+02 | 3.603300e+02 | 64.11 | 70.39 | 69.49 | 4.39 | 26.43 | 19.12 | 14.71 | 9.500000e-01 | 1.35 | 1.24 | 1.15 | 2.27 |
Key Takeaways
A total of 4,820 outliers were captured by the HBOS model which account for 1% of the train and test dataset respectively. Anomaly scores saw a large difference between normal and outlier groups.
Section 5: Empirical Cumulative Distribution-based Outlier Detection (ECOD) ¶
Approach:
- ECOD uses the empirical cumulative distribution function (ECDF) to model the data. It considers the rank of each data point in the cumulative distribution.
Scoring:
- The anomaly score is based on the empirical distribution of the data. Points far from the cumulative distribution curve are considered more anomalous.
Advantages:
- Non-parametric and can handle non-linear relationships and interactions between features.
Limitations:
- Can be computationally intensive for large datasets. Sensitive to the empirical distribution.
ecod = ECOD(contamination=0.01)
ecod.fit(X_train_norm)
# Training data
y_train_scores_ecod = ecod.decision_function(X_train_norm)
y_train_pred_ecod = ecod.predict(X_train_norm)
# Test data
y_test_scores_ecod = ecod.decision_function(X_test_norm)
y_test_pred_ecod = ecod.predict(X_test_norm)
# Print threshold and count statistics
print(f"Best hyperparameters for ECOD: {best_params}")
threshold_ecod = ecod.threshold_
print(f"The threshold for ECOD: {threshold_ecod}")
print(f"The training data (counts) for ECOD: {count_stat(y_train_pred_ecod)}")
print(f"The test data (counts) for ECOD: {count_stat(y_test_pred_ecod)}", '\n')
Best hyperparameters for ECOD: {'alpha': 0.1, 'contamination': 0.01, 'n_bins': 300}
The threshold for ECOD: 56.87318616811985
The training data (counts) for ECOD: {0: 350426, 1: 3540}
The test data (counts) for ECOD: {0: 87581, 1: 911}
plot_histogram_with_threshold(np.array(y_test_scores_ecod), "Anomaly Scores for ECOD", threshold_ecod)
# Get the descriptive statistics for outliers
stats_train_ecod_avg = descriptive_stat_threshold(X_train, y_train_scores_ecod, threshold_ecod)
stats_test_ecod_avg = descriptive_stat_threshold(X_test, y_test_scores_ecod, threshold_ecod)
stats_train_ecod_avg
| Group | Count | Count % | Average_Merchant_amount_Ratio | Average_Merchant_amount_30days_Ratio | Average_Merchant_amount_60days_Ratio | Average_Merchant_amount_90days_Ratio | Max_Merchant_amount_30days_Ratio | Max_Merchant_amount_60days_Ratio | Max_Merchant_amount_90days_Ratio | Avg_Transaction_Ratio | Avg_Transaction_30days_Ratio | Avg_Transaction_60days_Ratio | Avg_Transaction_90days_Ratio | Average_Vendor_amount_Ratio | Average_Vendor_amount_30days_Ratio | Average_Vendor_amount_60days_Ratio | Average_Vendor_amount_90days_Ratio | Anomaly_Score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Normal | 350426 | 98.999904 | 0.98 | 1.09 | 1.122860e+13 | 1.122860e+13 | 0.31 | 0.22 | 0.19 | 0.99 | 4.35 | 2.98 | 2.30 | 4.141689e+08 | 0.75 | 0.77 | 0.84 | 23.60 |
| 1 | Outlier | 3540 | 1.000096 | 3.92 | 111.12 | -8.844993e+14 | -8.844993e+14 | 31.15 | 29.61 | 24.03 | 1.41 | 12.36 | 8.59 | 5.95 | 8.881370e+12 | 6.83 | 4.86 | 5.37 | 64.19 |
stats_test_ecod_avg
| Group | Count | Count % | Average_Merchant_amount_Ratio | Average_Merchant_amount_30days_Ratio | Average_Merchant_amount_60days_Ratio | Average_Merchant_amount_90days_Ratio | Max_Merchant_amount_30days_Ratio | Max_Merchant_amount_60days_Ratio | Max_Merchant_amount_90days_Ratio | Avg_Transaction_Ratio | Avg_Transaction_30days_Ratio | Avg_Transaction_60days_Ratio | Avg_Transaction_90days_Ratio | Average_Vendor_amount_Ratio | Average_Vendor_amount_30days_Ratio | Average_Vendor_amount_60days_Ratio | Average_Vendor_amount_90days_Ratio | Anomaly_Score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Normal | 87581 | 98.970528 | 0.93 | 1.10 | 6.754294e+12 | 6.754294e+12 | 0.33 | 0.21 | 0.18 | 1.00 | 4.35 | 2.93 | 2.31 | 9.500000e-01 | 0.77 | 0.77 | 0.85 | 23.60 |
| 1 | Outlier | 911 | 1.029472 | 3.43 | 311.04 | -1.160999e+15 | -1.160999e+15 | 66.29 | 76.39 | 74.97 | 1.27 | 11.45 | 8.29 | 5.29 | -3.467090e+13 | 6.00 | 5.21 | 5.68 | 64.32 |
Key Takeaways
A total of 4,451 outliers were captured by the ECOD model which account for 1% of the train and test dataset respectively. Anomaly scores saw a large difference between normal and outlier groups.
Section 6: Models Predictions Comparison¶
# Compare HBOS and ECOD predictions
comparison_df = pd.DataFrame({'HBOS_pred': y_test_pred, 'ECOD_pred': y_test_pred_ecod})
print(pd.crosstab(comparison_df['HBOS_pred'], comparison_df['ECOD_pred']))
ECOD_pred 0 1 HBOS_pred 0 84085 0 1 3496 911
# Compute confusion matrix
cm = pd.crosstab(y_test_pred, y_test_pred_ecod)
# Plot confusion matrix as heatmap
trace2 = go.Heatmap(z=cm,
x=['Predicted 0', 'Predicted 1'],
y=['True 0', 'True 1'],
showscale=False,
colorscale=[
[0.0, "#c83a21"],
[0.02, "#2b7574"],
[1.0, "#ffffff"]
],
xgap=20,
ygap=20,
text=cm,
texttemplate="%{text}")
# Define layout
layout = go.Layout(
title=dict(text="Confusion Matrix between HBOS and ECOD", x=0.5, y=0.9, xanchor='center', yanchor='top'),
xaxis=dict(title='HBOS Pred', showticklabels=True),
yaxis=dict(title='ECOD Pred', showticklabels=True),
autosize=False,
width=500,
height=500,
plot_bgcolor='rgba(0,0,0,0)',
paper_bgcolor='rgba(0,0,0,0)'
)
# Plot heatmap
fig = go.Figure(data=[trace2], layout=layout)
fig.show()
Key Takeways from Crosstab:
Agree on Normal:
- Both HBOS and ECOD predicted 84,085 instances as normal (0)
Agree on Outliers:
- Both HBOS and ECOD predicted 911 instances as outliers (1)
Discrepancy in Predictions:
- There are 0 instances where HBOS predicted normal (0), but ECOD predicted outliers (1)
- There are 3,496 instances where HBOS predicted outliers (1), but ECOD predicted normal (0). This indicates that HBOS is more sensitive, potentially identifying more outliers but including some that ECOD considers normal.
The high count (84,085) of instances predicted as normal by both models indicates a strong agreement on what constitutes normal behavior.
Both models agreed on 911 instances being outliers. This agreement suggests that these instances are likely true outliers, as both models identified them as such despite their different methodologies.
ECOD is stricter with a higher threshold, leading to fewer false positives and potentially misses some outliers that HBOS catches (higher false negatives). On the other hand, HBOS is more sensitive with a lower threshold, leading to more outliers detected, including some instances that ECOD considers normal (higher false postives)
Section 7: Conclusion¶
Detection Method Comparison:
HBOS tends to identify fewer outliers with scores more tightly clustered around the center. ECOD detects a broader range of outliers, showing more granularity in the outlier scores.
Outliers Locations:
Outliers are generally detected with higher scores (above 2 for HBOS and above 5 for ECOD). These outliers are found at the tails of the distribution in both training and test datasets.
In the context of credit card transaction anomaly detection, both models have their strengths and weaknesses. ECOD is more aggressive in flagging anomalies, which might be better for catching more fraudulent transactions at the cost of higher false positives. HBOS, being more conservative, might result in fewer disruptions to legitimate transactions but may miss some anomalies. Depending on the specific goals and risk tolerance of the anomaly detection system, either model, or a combination of both, could be employed to enhance the overall detection capability.